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FOREWORD 


Opinions,  interpretations,  conclusions  and  recommendations  are 
those  of  the  author  and  are  not  necessarily  endorsed  by  the  U.S. 
Army, 


Where  copyrighted  material  is  quoted,  permission  has  been 
obtained  to  use  such  material. 

Where  material  from  documents  designated  for  limited 
distribution  is  cpioted,  permission  has  been  obtained  to  use  the 
material . 


citations  of  commercial  organizations  and  trade  names  in 
this  report  do  not  constitute  an  official  Department  of  Army 
endorsement  or  approval  of  the  products  or  services  of  these 
organizations. 


In  conducting  research  using  animals,  the  investigator ( s ) 
adhered  to  the  "Guide  for  the  Care  and  Use  of  Led^oratory 
Animals,"  prepared  by  the  Committee  on  Care  and  use  of  Laboratory 
Animals  of  the  Institute  of  Laboratory  Resources,  national 
Research  Council  (NIH  Publication  No.  86-23,  Revised  1985). 

the  protection  of  human  subjects,  the  investigator  (s) 
adhered  to  policies  of  applicable  Federal  Law  45  CFR  46. 

In  conducting  research  utilizing  recombinant  DNA  technology, 
the  investigator (s)  adhered  to  current  guidelines  promulgated  by 
the  National  Institutes  of  Health. 

A/^  In  the  conduct  of  research  utilizing  recombinant  DNA,  the 
investigator (s)  adhered  to  the  NIH  Guidelines  for  Research 
Involving  Recombinant  DNA  Molecules. 

AAA  In  the  conduct  of  research  involving  hazardous  organisms, 
the  investigator (s)  adhered  to  the  CDC-NIH  Guide  for  Biosafety  in 
Microbiological  and  Biomedical  Laboratories. 
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Computer  Aided  Diagnosis  of  Breast  Cancer:  A  Multi-Center  Demonstration. 

PI:  Carey  E.  Floyd  Jr. 

Statement  of  Work 

(months  1-36) 

1) Acquire  diagnostic  mammography  cases  from  mammography  providers 

distributed  over  a  wide  geographical  area  using  the  BI-RADS™  findings 
reporting  criteria. 

(months  1-6)  Develop  tools  for  managing  the  database  and  generating  reports) 
Cases  will  be  acquired  from  each  site  and  entered  into  the  database  as  a  continual 
effort. 

(months  1-36) 

2) Test  the  existing  CAD  system  on  biopsy  cases  from  other  mammographic 

facilities  (external  to  Duke).  This  testing  will  be  performed  on  a  monthly 
schedule.  The  results  will  be  summarized  at  the  end  of  the  first  six  mondis 
and  periodically  through  the  project. 

3) Develop  an  ANN  to  predict  biopsy  outcome  from  BI-RADS™  mammographic 

and  history  findings  for  the  individual  and  combined  datasets  from  other 
mammographic  facilities. 

(months  1-6)  Develop  tools  for  importing  cases  from  the  database  into  the 
artificial  neural  network  systems. 

(months  6-12)  Refine  the  coding  of  the  ANNs  to  facilitate  use  with  large  datasets, 
(months  6  -36)  Examine  the  behavior  of  the  different  training  techniques:  cross- 
validation,  bootstrap,  and  round-robin  as  the  datasets  grow  in  size. 
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4)Evaluate  the  difference  between  the  individual  and  combined  networks, 
(months  6-36)  This  work  will  begin  in  the  first  year  as  the  data  and  tools  become 
available.  It  will  continue  throughout  the  project. 

Progress  in  the  second  period  (months  12-24) 

(months  1-36) 

DAcquire  diagnostic  mammography  cases  from  mammography  providers  distributed  over 
a  wide  geographical  area  using  the  BI-RADS™  findings  reporting  criteria. 

Progress  has  been  made  toward  this  aim  and  the  progress  is  on  target. 
Specifically, 

In  the  second  year  of  this  project  we  have: 

1  Ported  the  database  from  the  FOXPRO  database  language  into  ACCESS 
since  the  commercial  support  for  FOXPRO  has  diminished. 

2  Searched  the  Tumor  Registry  of  the  Duke  University  Medical  Center 
Comprehensive  Cancer  Center  to  attempt  to  find  cases  initially  read  as  benign 
that  turned  out  to  be  malignant  from  the  700  cases  from  Duke. 

3  acquired  500  cases  from  Sloan-Ketering. 

4  acquired  500  cases  from  U  of  Maryland. 

Cases  will  be  acquired  from  each  site  and  entered  into  the  database  as  a  continual  effort. 
(months  1-36) 

These  cases  have  been  entered  into  the  database  and  have  been  examined  for 
completeness  and  accuracy.  About  8%  of  the  records  were  contradictory  or 
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incomplete  and  a  portion  of  these  were  recovered  after  iteration  with  the 
contributing  sites. 

2)Test  the  existing  CAD  system  on  biopsy  cases  from  other  mammographic 
facilities  (external  to  Duke).  This  testing  will  be  performed  on  a  monthly 
schedule.  The  results  will  be  summarized  at  the  end  of  the  first  six  months 
and  periodically  through  the  project. 

This  work  has  been  performed  for  the  1000  cases  from  Penn  and  is  reported 
below. 


3)Develop  an  ANN  to  predict  biopsy  outcome  from  BI-RADS™  mammographic  and 
history  findings  for  the  individual  and  combined  datasets  from  other  mammographic 
facilities. 


Done  (Reported  below) 

(months  1-6)  Develop  tools  for  importing  cases  from  the  database  into  the  artificial  neural 
network  systems. 

(months  6-12)  R^ne  the  coding  of  the  ANNs  to  facilitate  use  with  large  datasets. 

(months  6  -36)  Examine  the  behavior  of  the  different  training  techniques:  cross-validation, 
bootstrap,  and  round  robin  as  the  datasets  grow  in  size. 

4)Evaluate  the  difference  between  the  individual  and  combined  networks. 

(months  6-36)  This  work  will  begin  in  the  first  year  as  the  data  and  tools  become  available. 
It  will  continue  throughout  the  project. 


6 


Progress  report  for  DAMDl 7-96-1-6226 


Publication 

In  the  current  period,  we  published  1  manuscript  and  5  abstracts  describing 
work  funded  in  whole  or  in  part  by  this  grant. 

Body  of  Report 

We  describe  an  Artificial  Neural  Network  (ANN)  approach  to  computer 
aided  diagnosis  of  breast  cancer  from  mammographic  findings.  An  ANN  has 
been  developed  to  provide  support  for  the  clinical  decision  to  perform  breast 
biopsy.  The  system  is  designed  to  aid  in  the  decision  to  biopsy  those  patients 
who  have  suspicious  mammographic  findings.  The  decision  to  biopsy  can  be 
viewed  as  a  two  stage  process:  l)the  mammographer  views  the  mammogram 
and  determines  the  presence  or  absence  of  image  features  such  as  calcifications 
and  masses,  2)  the  presence  and  description  of  these  features  and  the  patient's 
medical  history  are  merged  to  form  a  diagnosis.  The  ANN  system  is  an  aid  to  the 
second  step  and  is  motivated  by  the  large  fraction  of  biopsies  that  are  benign. 

While  mammography  is  a  sensitive  procedure  for  detecting  breast  cancer, 
the  positive  predictive  value  (PPV)  is  low.  Only  10-34%  of  women  who  imdergo 
biopsy  for  mammographically  suspicious  nonpalpable  lesions  actually  are  found 
to  have  malignancy(Kopans  1992)  Between  0.5  -  2.0%  of  all  mammographic 
exams  result  in  biopsy;  several  hundreds  of  thousands  of  biopsies  are  performed 
on  benign  lesions  each  year.  The  women  undergoing  biopsy  for  a  benign  finding 
are  unnecessarily  subjected  to  the  discomfort,  expense,  potential  complications, 
change  in  cosmetic  appearance,  and  anxiety  that  can  accompany  breast 
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biopsy(Helvie,  Ikeda  et  al.  1991;  Dixon  and  John  1992;  Kopans  1992;  Schwartz, 
Carter  et  al.  1994).  In  addition,  the  financial  burden  of  these  procedures 
(between  $3000  and  $5000  per  biopsy)  is  significant  in  the  present  political  and 
economic  effort  to  reduce  expenditures.  Our  system  may  significantly  improve 
this  performance  through  an  ANN  approach  that  utilizes  a  large  database  of 
cases  with  known  outcomes.  In  clinical  practice,  this  system  can  be  easily 
integrated  into  the  mammographers'  work  flow  through  a  computerized 
reporting  system.  The  clinician  reads  a  mammogram  and  records  the  findings 
into  a  computer  using  a  standard  reporting  lexicon  (BI-RADS™).  The  categorical 
findings  for  the  case  are  encoded  as  numerical  values  and  are  presented  to  the 
ANN  as  inputs.  The  ANN  produces  an  output  fiiat  is  associated  with  the 
likelihood  of  malignancy.  This  fraction  is  referred  to  as  the  malignancy  fraction 
and  is  an  intuitive  response  that  the  woman's  health  care  team  can  then  include 
in  the  medical  decision  for  biopsy. 

In  this  report  we  describe  in  detail  the  comparison  of  the  model  developed 
on  one  dataset  and  evaluated  on  another.  This  was  the  primary  goal  of  the 
project. 


Previous  studies  at  Duke  University  Medical  Center  resulted  in  a 
computer  model  to  predict  breast  lesion  malignancy  based  upon  BI-RADS 
mammographic  features  and  the  patient  age(Floyd,  Lo  et  al.  1994;  Baker, 
Komguth  et  al.  1995;  Baker,  Kornguth  et  al.  1995;  Floyd,  Lo  et  al.  1995;  Lo,  Baker 
et  al.  1995;  Lo,  Baker  et  al.  1995;  Lo,  Baker  et  al.  1995;  Lo,  Baydush  et  al.  1995;  Lo, 
Grisson  et  al.  1995;  Baker,  Komguth  et  al.  1996;  Lo,  Baker  et  al.  1996;  Lo  and 
Floyd  1996;  Lo,  Kim  et  al.  1996;  Lo,  Baker  et  al.  1997;  Lo,  Baker  et  al.  1997;  Lo  and 
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Floyd  1997;  Lo,  Baker  et  al.  1998).  This  study  evaluates  performance  of  this 
artificial  neural  network  (ANN)  model  for  cases  from  an  independent  institution, 
the  University  of  Permsylvania. 

At  both  institutions,  consecutive  cases  of  nonpalpable  breast  lesions  which 
underwent  excisional  biopsy  were  selected,  resulting  in  a  set  of  500  cases  from 
Duke  and  1000  cases  from  Pennsylvania.  For  each  lesion,  ten  BI-RADS 
descriptors  and  the  patient  age  were  recorded  by  expert  mammographers.  An 
original  ANN  was  trained  and  tested  on  the  Duke  cases  to  predict  biopsy 
outcome.  With  no  further  adaptation,  this  ANN  was  tiien  evaluated  on  the 
Pennsylvania  cases.  The  h5^otiiesis  was  that  a  network  trained  on  cases  from 
one  institution  could  generalize  and  accurately  predict  the  outcomes  for  cases 
from  another  institution.  To  test  this  hypothesis.  The  ANN  that  had  been  trained 
on  the  Duke  cases  was  then  evaluated  on  the  cases  from  Penn.  For  comparison, 
another  ANN  was  trained  and  tested  on  the  Penn  data  alone.  The  performance  of 
these  three  evaluations  is  presented  below. 

The  ANN  that  was  trained  and  tested  on  the  Duke  cases  alone  performed 
with  ROC  area  of  0.86  ±  0.02.  the  ROC  curve  for  this  network  is  shown  as  the 
solid  line  in  fig.  1.  The  ANN  that  was  trained  and  tested  on  the  Penn  cases  alone 
performed  with  ROC  area  of  0.82  ±  0.02.  The  ROC  curve  for  this  network  is 
shown  as  the  long  dashed  curve  in  fig.  1.  These  results  suggest  that  the 
Pennsylvania  cases  alone  are  more  challenging  to  describe  with  an  ANN  model 
than  the  Duke  cases.  When  the  network  trained  on  Duke  cases  was  evaluated  on 
the  Perm  cases,  an  ROC  area  of  0.79  ±  0.01  was  obtained.  This  curve  is  plotted  as 
the  short  dashed  curve  in  Fig.  1. 
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Network  Performance  From  Multiple  Sites 


Fig.  1  ROC  comparison  of  the  three  ANN  evaluations. 

While  ROC  area  is  the  most  common  criteria  for  comparing  two 
diagnostic  systems,  using  this  criteria  assumes  an  equal  "cost"  for  miss- 
classifying  a  positive  and  a  negative  case.  For  the  medical  decision  of  whether  to 
biopsy  a  suspicious  region  in  a  breast,  the  cost  of  missing  a  true  cancer  is  higher 
than  the  cost  of  performing  a  biopsy  on  a  benign  region.  While  a  cost-benefit 
analysis  is  the  best  technique  for  evaluating  such  a  problem,  this  is  beyond  the 
scope  of  this  project.  In  clinical  practice,  no  decision  aid  will  be  accepted  that 
performs  with  less  than  a  high  sensitivity.  The  performance  of  the  ANNs  was 
evaluated  by  comparing  the  specificity  at  a  high  fixed  level  of  sensitivity  of  98%. 
The  performance  of  the  three  systems  is  compared  in  table  1  where  is  shown  the 
ROC  are  (Az),  the  specificity  at  98%  sensitivity,  the  Positive  predictive  value 
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(PPV),  the  number  of  malignancies  missed  and  the  number  of  benign  biopsies 
saved. 

Table  1 


Comparison  of  the  performance  of  the  systems. 


train  /  test 

spec,  at  98% 
sens 

cancers 

missed 

biopsies 

obviated 

PPV 

Az 

Duke  MDs 

12% 

4/174 

39  /  326 

35% 

0.82  ±0.02 

Duke  / 

Duke 

42% 

3/174 

136  /  326 

47% 

0.87  ±0.02 

Penn  /  Penn 

15% 

7/396 

90  /  604 

43% 

0.82  ±0.01 

Duke  / 

Penn 

18% 

7/396 

107  /  604 

44% 

0.79  ±0.01 

Table  1  Comparison  of  the  performance  of  tl 

le  systems. 

Row  1: 

The  "Duke  network"  (trained  on  Duke  500,  tested  on  Duke  500)  improved 
specificity  at  98%  sensitivity  over  Duke  MDs  dramatically,  from  12%  to  42%. 
There  was  also  cm  improvement  in  PPV  from  35%  to  47%  and  in  Az  from  0.82  to 
0.87  (p=0.08). 

Row  2: 

The  "Perm  network"  (trained  on  Perm  Ik,  tested  on  Penn  Ik)  simulating  effect  of 
customizing  an  ANN  just  for  Penn  showed  much  lower  specificity  (15%)  and 
somewhat  lower  PPV  (43%)  and  Az  (0.82)  compared  to  the  Duke  net.  This  is  all 
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consistent  with  the  Penn  data  set  being  inherently  more  challenging  as  noted 
above. 

Row  3: 

The  "Cross  network"  (trained  on  Duke  500,  tested  on  Penn  Ik)  simulating  effect 
of  cross-institution  application  showed  similarly  poorer  performance.  In 
particular,  specificity  was  18%  and  Az  only  0.79.  It  should  be  noted  however  that 
the  performance  was  almost  identical  to  that  of  the  Penn  net.  No  matter  if  the 
ANNs  were  trained  on  Duke  or  Penn  cases,  both  performed  equally  poorly  on 
the  Penn  cases.  In  other  words,  the  limiting  factor  may  be  the  inherent  difficulty 
of  the  Penn  cases,  not  the  ANN's  inability  to  generalize.  This  is  encouraging 
because  nothing  was  gained  by  customizing  the  ANN  specifically  for  Penn.  If  we 
can  learn  how  to  characterize  the  Penn  cases  better,  we  can  probably  make  an 
ANN  that  will  generalize  better  as  well.  Another  encouraging  observation:  the 
cross  net  maintained  the  same  high  PPV  as  the  duke  net  and  Penn  net  (all  in 
40's).  All  3  PPVs  were  much  higher  than  that  of  original  Duke  MDs. 


CONCLUSION:  The  ANN  that  was  trained  on  Duke  cases  alone  generalized 
successfully  to  a  relatively  large,  independent  data  set.  The  performance  was 
comparable  to  or  better  tiian  that  of  the  radiologists  at  that  institution,  and  only 
slightly  worse  than  a  new  ANN  specifically  optimized  for  the  new  cases.  This 
breast  cancer  prediction  model  thus  shows  potential  to  be  applied  in  other 
institutions  which  also  utilize  the  standardized  BI-RADS  mammography  lexicon, 
and  it  may  help  reduce  the  number  of  tmnecessary  biopsies. 
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